Search CORE

352 research outputs found

Replacing 6T SRAMs with 3T1D DRAMs in the L1 data cache to combat process variability

Author: Canal Corretger Ramon
Liang Xiaoyao
Wei Gu-Yeon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

With continued technology scaling, process variations will be especially detrimental to six-transistor static memory structures (6T SRAMs). A memory architecture using three-transistor, one-diode DRAM (3T1D) cells in the L1 data cache tolerates wide process variations with little performance degradation, making it a promising choice for on-chip cache structures for next-generation microprocessors.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Weightless: Lossy Weight Encoding For Deep Neural Network Compression

Author: Adolf Robert
Brooks David
Gupta Udit
Mitzenmacher Michael M.
Reagen Brandon
Rush Alexander M.
Wei Gu-Yeon
Publication venue
Publication date: 13/11/2017
Field of study

The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art

arXiv.org e-Print Archive

UCL Discovery

S $^{3}$ : Increasing GPU Utilization during Generative Inference for Higher Throughput

Author: Brooks David
Jin Yunho
Wei Gu-Yeon
Wu Chun-Feng
Publication venue
Publication date: 09/06/2023
Field of study

Generating texts with a large language model (LLM) consumes massive amounts of memory. Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem. To this end, we propose S

^{3}

, which predicts the output sequence length, schedules generation queries based on the prediction to increase device resource utilization and throughput, and handle mispredictions. Our proposed method achieves 6.49

\times

throughput over those systems that assume the worst case for the output sequence length

arXiv.org e-Print Archive

Pulsenet - A Parallel Flash Sampler and Digital Processor IC for Optical SETI

Author: Dally William J.
Horowitz Paul
Howard Andrew W.
Wei Gu-Yeon
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2006
Field of study

PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers that digitize 32 analog signals at 1GS/s, and other circuits that filter samples, store candidate signals, and perform astronomical observations. Its ~250,000 CMOS transistors (TSMC 0.25μm) dissipate 1.1W at 400MHz and 2.5V

Crossref

Caltech Authors

Recommended from our members

A fully integrated battery-connected switched-capacitor 4:1 voltage regulator with 70% peak efficiency using bottom-plate charge recycling

Author: Brooks David M.
Kim Wonyoung
Tong Tao
Wei Gu-Yeon
Zhang Xuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2016
Field of study

This work presents a switched-capacitor (SC) DC-DC voltage regulator that converts a 3.7V battery voltage down to ~0.8V in order to power the `brain' SoC of a flapping-wing microrobotic bee. A cascade of two 2:1 SC converters offers high efficiency for a 4:1 conversion ratio. A charge recycling technique reduces the flying capacitor's bottom-plate parasitic loss by 50% and overall conversion efficiency reaches 70%. The output droop is less than 10% of the nominal output voltage for a worst-case 47mA load step.Engineering and Applied Science

Harvard University - DASH